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Abstract 

The computation of p-values in goodness-of-fit or likelihood ratio tests is often 
done with simple analytic formulae, even if the validity of these formulae is far 
from obvious. In particular, no simple formula exists for the comparison of models 
that are not nested, in the sense that one model can be obtained from the other 
by fixing some of its parameters. In this paper I present a strategy for efficient 
numerical computations of p-values, which works for both nested and non-nested 
models and does not rely on additional approximations. These ideas have been 
implemented in a publicly available C++ framework for maximum likelihood fits 
called my Fitter and have recently been applied in a global analysis of the Standard 
Model with a fourth generation of fermions. 
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1 Introduction 



Even though the LHC experiments have, so far, not found any clear signs for physics 
beyond the Standard Model (SM) they already put strong constraints on the favourite 
SM extensions of many theorists. The SM with a (perturbative) sequential fourth 
generation of fermions (SM4) is under a lot of pressure from direct Higgs searches and 
may be the first to be excluded. Other models with additional fermions or even some 
constrained versions of Supersymmetry may follow soon. 

In this situation some thoughts should be spent on the methods and criteria by which 
we decide if a certain model is ruled out. Two well-established techniques in (frequentist) 
statistical analyses are goodness- of-fit and likelihood ratio tests. (For an introduction see 
e.g. [ ] or the statistics chapter of [2].) In the former case, the minimal value of the x 2 
function is used as a test statistic, while in the latter case one uses the difference A% 2 of 
the minimum x 2 -values of two models as a test statistic. In many cases, the probability 
density function (PDF) of the test statistic is, to a good approximation, given by the 
well-known ^-distribution [ ]. In this case the relation between the minimal x 2 value of 
a model or the A% 2 value of two models and the statistical significance (p-value) of the 
corresponding hypothesis tests is described by the normalised lower incomplete gamma 
function. 

There are, however, also many realistic scenarios where the PDF of the test statistic 
is not described by a ^-distribution. One example is the case of likelihood ratio tests 
where the two models to be compared are not nested, meaning that one model can not be 
obtained from the other by fixing some of its parameters. This problem was encountered 
in a recent analysis of the Standard Model (SM) with a fourth generation of fermions 
[4]. In this analysis it is impossible to regard the SM with three fermion generations 
as a limiting case of the SM with four generations due to non-decoupling contributions 
of chiral fermions in electroweak precision observables and Higgs production and decay 
rates. Another case where analytical formulae for p- values are not reliable is the situation 
where some of the parameters of a model are bounded, in the sense that they are only 
allowed to float within a certain range. Most notably, this applies to analyses where 
systematic errors are treated within the i?Fit scheme [5], i.e. by introducing so-called 
nuisance parameters with a limited range. 

When analytic formulae fail one has to resort to numerical methods, and the com- 
putation of p-values is no exception. The basic strategy is to generate a large sample of 
random toy measurements, whose distribution is given by the experimental errors of the 
observables and which is centred on the model's prediction. For each toy measurement 
the value of the test statistic is computed and compared to the value obtained from the 
actual data. With a large enough sample we can then estimate the probability that the 
value of the test statistic is more extreme than the observed value, and this probability 
is called statistical significance or p-value. Unfortunately, the computational cost of 
the required numerical simulations can be rather high, especially when the p-value is 
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small. In this paper I discuss some methods for improving the efficiency of numerical 
computations of p- values. These methods have been applied in [6], where, based on 
the constraints from Higgs searches and electroweak precision observables, likelihood 
ratio tests comparing the SM with three and four fermion generations were performed. 
The methods are also implemented in a publicly available code called my Fitter, which 
I present in this paper. 

The paper is organised as follows: in Sec. 2 I describe the general mathematical 
setup and the definitions of the different types of p-values. In Sec. 3 I go trough the 
derivation of the analytical formulae for p- values and discuss their range of applicability. 
In Sec. 4 I explain the strategy for improving the efficiency of numerical computations 
of p-values. The m^Fitter code and the implementation of the methods from Sec. 4 is 
presented in Sec. 5. I conclude in Sec. 6. 



2 General Setup 

To fix some notations let me first introduce the general mathematical setup for goodness- 
of-fit and likelihood ratio tests. The discussion essentially follows the statistics review 
in [2] , but puts a stronger emphasis on the geometric interpretation of the log-likelihood 
function. This geometric picture will be needed in the next sections. 

Let Q = (Qi)i=i,...,n be a vector of observables whose experimentally measured values 
may lie in a sample space S C R n . Assume that the true values of these observables 
are described by a vector Q £ S. The results Qi of the measurements may then be 
regarded as n (generally not independent) random variables distributed according to a 
(joint) probability density function (PDF) 7Ta(Q). The log-likelihood function L(Q, Q) 
is defined as 

L(Q,Q) = -21n7TQ(Q) . (1) 
In this setup, L is required to have the following properties: 

1. For any given Q 6 S, the log-likelihood function L must satisfy the normalisation 
condition 

/ cTQe-^(Q'Q) = 1 . (2) 
Js 

2. For any given Q £ S, and considered as a function of Q, the log-likelihood function 
L(Q, Q) must be bounded from below and have its unique absolute minimum at 
Q = Q. 

3. For any given Q G S, and considered as a function of Q, the log-likelihood function 
L(Q, Q) must be bounded from below and have its unique absolute minimum at 
Q = Q. 
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The first property is simply the normalisation condition for the PDF tTq(Q). The 
second property allows us to regard L(®, Q) as a measure of disagreement between a 
theoretical prediction G S and a measured set of observables Q G S. This property 
justifies the method of maximum likelihood fits, where theory parameters are determined 
by minimising L. 

The third property reflects the fact that the measurements of the Qi really measure 
the quantities Qi. Without any modifications to the model, it does not hold in the 
presence of systematic errors, since a systematic error is an offset between the true 
value of an observable and its most likely measured value. This offset is the same 
each time the measurement is performed and does therefore not average out when the 
measurement is repeated many times. This results in a difference between Qi and 
the maximum of the distribution of the random variable Qi. The central idea of the 
RYit method [5] is that systematic errors should not be treated as errors at all, but 
as unknown theory parameters, so-called nuisance parameters, that may vary within a 
certain range. In a way, the presence of a systematic error means that theorists and 
experimentalists are simply not talking about the same quantity. Since the difference 
between the two quantities can neither be modeled nor measured it has to be treated as 
an additional model parameter, but with a limited range of possible values. Thus, the 
third assumption does hold if systematic errors are treated within the i?Fit scheme, i.e. 
by introducing a nuisance parameter for each source of systematic errors. 

In this generic setting, a theory is simply a function © that maps parameters x = 
(xj)j = i v .. 5 fc from some parameter space X C R fe into 5*. In the absence of systematic 
errors k should be smaller than the number n of observables if one wants to constrain the 
parameter space X. However, if all observables are affected by systematic errors, and an 
independent nuisance parameter is introduced in each case, the number k of parameters 
may indeed be larger than n. Of course, the ranges of the nuisance parameters will be 
limited. In any case we define the theory manifold M as 

M = {©(x)|x G X} C S . (3) 

To quantify the disagreement between the theory and some vector Q G S of measured 
observables we define the squared distance function d 2 by 

d 2 (Q) = min [L(Q\ Q) - L(Q, Q)] = min[L(0(x), Q) - L(Q, Q)] G R . (4) 

Q'eM xGX 

The definition of d 2 (Q) is the usual definition of the squared distance between the 
manifold M and the point Q if the squared distance between two points in S is measured 
by L. Note that I will use the term "squared distance" for the values L(Q, Q) even 
though, in this general setting, a/Z is not guaranteed to have the usual properties of a 
distance measure, such as positivity or the triangle inequality. However, these properties 
are not needed for our discussion. All we require for our definitions to make sense are 
the properties 1 to 3 from the start of this section. In the case of Gaussian errors we 
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will see that L(Q, Q) is (up to a constant) indeed just the squared euclidean distance 
between Q and Q. 

Except for very contrived and pathological cases, there will only be one element of 
M with minimal distance to Q. We denote these best-fit observables as Q(Q) G S, so 
that 

L(Q(Q),Q)=d 2 (Q) . (5) 
The domain X(Q) C X of best-fit parameters is then 

X(Q) = e- 1 (Q(Q)) = {xGX|e(x) = Q(Q)} . (6) 

In the absence of systematic errors the set -A(Q) will usually contain only one element, 
namely, the best-fit parameter vector x(Q). However, if many nuisance parameters 
are introduced because of systematic errors there may be ambiguities. The definitions 
below rely on Q(Q) being unique, but not on x(Q) being unique. Note that, in order 
to compute <i 2 (Q) and x(Q) one has to go through the usual procedure of minimising 
the chi-square function, which, for a given vector of measured observables Q, is defined 
as 

X 2 (x) = L(0(x),Q)-L(Q,Q) . (7) 

The best-fit parameters x(Q) are then just the parameters that minimise the \ 2 function 
and gP(Q) is the minimal x 2 - value. 

With these preparations we are now ready to write down the formula for the p- value 
p(Q) of a goodness-of-fit test of the theory 0, assuming measured values Q of the 
observables: 

p(Q) = |rf"Q , 7 TQ (Q) (Q / )^ 2 (Q / )-rf 2 (Q)) , (8) 

where 9 is the Heavyside step-function. Eq. 8 defines p(Q) as the probability that a 
random observable-vector Q' distributed according to 7Tq(q) disagrees more with the 
theory © than the observable-vector Q which was actually measured. By using 7Tq(q) 

as PDF for Q' we assume that the true observables are the best-fit observables Q(Q), 
i.e. that the theory is realised with parameters from the best-fit domain X(Q). This is 
our null hypothesis. 

To compare the performance of two different models one might be tempted to simply 
perform two goodness-of-fit tests and compare the resulting p- values. However, it is easy 
to construct an example where this strategy fails: Assume that we have 100 observables 
in our fit and model A (at it's best-fit point) describes the first 99 very well, but is in 
strong disagreement with the last observable. Model B, on the other hand, describes the 
first 99 observables as well as model A and also nicely agrees with the last observable. 
Then model B clearly performs better than model A, but the p- values of the two models 
would be essentially the same due to the large number of contributions to the log- 
likelihood function. 



4 



A broadly accepted strategy for comparing the performance of two models are like- 
lihood ratio tests. This type of hypothesis test is used when one is not interested in 
the agreement of the theory as a whole with certain data, but rather in the agreement 
of some restricted version of a theory, assuming that the full theory is correct and de- 
scribes the data. To this end, let us split up the parameter vector x of the theory into 
a relevant part a and an irrelevant part y, i.e. write x = (a, y) e A x Y = X, with the 
understanding that we are interested in the values of the parameters a but not in the 
values of the parameters y. For a fixed value of a we now define the restricted versions 
©a, M a , d a and Q a of the theory function, theory manifold, squared distance function, 
and best-fit observables as follows: 

a (y) = 0(a,y) , M a = {© a (y)|y EY} c M , 

d 2 a (Q) = min L(Q', Q) , L(Q a (Q),Q) = d a (Q) . (9) 
Q'eM a 

When comparing the restricted theory with the full theory one uses the difference of 
X 2 -values of the restricted and the full theory as a test-statistic. To that end we denote 
the difference of squared distance functions as 

A*(Q) = dl(Q) - d\Q) . (10) 

Note that A a is just a difference of minimal chi-square values for the measurement 
Q, since d a (Q) and <i 2 (Q) are the minimal chi-square values of the restricted and the 
full theory, respectively. With these notations, the p-value p a (Q) f° r fixed relevant 
parameters a and assuming measured values Q of the observables is defined as 

P.(Q) = J cTQ'7TQ a(Q) (Q'MA a (Q') - A a (Q)) . (11) 

It is the probability that some random set of measurements Q' distributed according 
to 7Tq /qv leads to a larger increase in disagreement than the actually measured point 
Q when going from the full to the restricted theory. By using 7Tq ^ as PDF for Q' 
we assume that the true observables are the best-fit observables of the restricted theory 
© a (our null hypothesis). Note that the p- value p a is equal to 1 if the parameters a 
are equal to their best-fit values for the full theory. In this sense, likelihood ratio tests 
measure the performance of the restricted model in relation to the full model: if the 
restricted model describes the data as well as the full model, the likelihood ratio test 
gives a p- value of 1. 

In the situation discussed above one also speaks of nested models: the restricted the- 
ory can be obtained from the full theory by fixing some of its parameters. Geometrically 
this means that the manifold M a of the restricted theory is a subset of the full theory's 
manifold M. Generalisations of likelihood ratio tests for non-nested models have been 
discussed, for example, in [7; ]. From the geometric point of view, the generalisation is 
rather straightforward, since there is a canonical way of combining two theories into one. 
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Let ©i and 2 be two theories with unrelated parameter spaces of (possibly) different 
dimension and let Mi and M 2 be the corresponding theory manifolds. The manifold of 
the combined theory is then simply the union of the two theory manifolds: 

M = M 1 UM 2 Mi, M 2 C M . (12) 

The definition of the p-value in likelihood ratio tests of nested models only depends 
on the manifolds of the models, but not on their parametrisation. By using either M\ 
or M2 as the restricted manifold (M a ) and M as the manifold of the full theory, the 
definition (11) can be used without modification. The squared distance function of the 
full theory is given by 

d 2 (Q)=mm{dj(Q),dj(Q)} , (13) 

where d\ and d\ are the squared distance functions of the theories ©i and ©2, respec- 
tively. The differences A^ and A2 of squared distance functions for restrictions to the 
theories ©1 or @ 2 , respectively, are 

A?(Q) = max{d?(Q)-^(Q),0} , A*(Q) = max{4(Q) - d?(Q), 0} . (14) 

For the theory 0, with the smaller \ 2 value the function Af is and the p-value is 
therefore always 1. This is analogous to the case of nested models, where the p- value at 
the best-fit point of the full theory is always 1. 



3 Analytical Formulae for p- values 

Now that we have agreed on the definitions of p-values in goodness-of-fit and likelihood 
ratio tests, let us try to compute them. If all else fails the p-values defined in the last 
section can be computed with Monte Carlo methods. This has already been outlined in 
[5]. In practice, these Monte Carlo simulations can be very time consuming. However, 
under certain assumptions it is possible to derive simple (and well-known) analytical 
formulae for the p- values [3] . To understand the range of applicability of these formulae 
it is best to re-derive them. The derivations I present here use the geometric picture es- 
tablished in the last section. The derivations will also be instructive for the next section, 
where I discuss the possibilities for improving the efficiency of numerical computations 
of p-values. 

Consider a set of observables O = (0i)i=i,...,n distributed about their best-fit values 
Oi according to an n-dimensional Gaussian distribution with a variance matrix V. In a 
first step we choose suitable linear combinations of the Oi to define a set of independent 
(i.e. uncorrelated) normalised observables Qi which are distributed according to normal 
distributions (i.e. with expectation values of zero and standard deviations of one). To 
this end, let U be an orthogonal matrix that diagonalises V: 

UVU T = di&g{<j 2 1 ,...,* 2 n ) . (15) 
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Then we define 

1 - 

Q< = -T, u *(°* - °a ■ ^ 

The PDF of the observables Qi and the corresponding log-likelihood function are then 
given by 

noiQ) = (2^ e ~ IIQ ' 2 ^ £(0,Q) = nln(27r) + |Q| 2 . (17) 

So we see that, up to a constant term, the log-likelihood function is just the squared 
euclidean length of the normalised observable- vectors Q. The constant drops out in the 
definition of the squared distance function d 2 and the x 2 function. 

As explained in the last section, the theory defines some manifold M in the space of 
normalised observables Q. If M is sufficiently flat near the origin and extends sufficiently 
far away from it we can approximate M in the vicinity of (the best-fit point) as a 
hyperplane. The flatness requirement will usually be valid if the errors o~i of the original 
observables Oj are small, because in the transition to normalised observables the theory 
manifold is stretched by factors The assumption of large extension is valid as long 
as the theory function is sufficiently regular in the vicinity of the best-fit domain X(Q) 
and if X(Q) is not close to any borders the parameter space X might have. In the 
presence of (small) systematic errors the extra parameters introduced to describe them 
have only small allowed ranges, so that all points in X lie close to the border. Thus the 
simple expressions for p-values derived in the following paragraphs should not be used 
in the presence of systematic errors. In addition to that, extreme values for the best-fit 
parameters or a singular behaviour of the theory function near the best-fit point also 
indicate a failure of these expressions. 

Note that approximating the theory manifold as flat in the vicinity of Q is not the 
same as taking the theory function © to be linear in its arguments. We just neglect the 
curvature of the manifold M in the vicinity of 0, but the map from X to the hyperplane 
M may still be non-linear. 

If the theory manifold M is a hyperplane of dimension k < n trough the origin, 
we may write any observable-vector Q' as Q' = Qji + Q' ± with Qj, £ M and Q' ± 
perpendicular to it. The squared distance <i 2 (Q') between M and Q' is then simply the 
squared length of Q' ± : 

d 2 (Q') = |Q'J 2 • (18) 

For a two-dimensional sample space and a one-dimensional theory manifold, this is 
depicted in Fig 1. The 9 function in (8) vanishes in the region between the thin dashed 
lines. Eq. 8 now simplifies to 

KQ) = / d^e-^J r- k Q' ± e-hm 2 9(\Q'^-d 2 (Q)) 

= l-P(n- fc )/ 2 (^ 2 (Q)) , (19) 



7 



Ql 



Qji Q' 



M 



Q 

Figure 1: Orthogonal decomposition of the sample space in a goodness-of-fit test. The blue 
colour indicates the probability density for the toy observable vector Q'. The 9 function in 
(8) vanishes in the region between the thin dashed lines. 

where 

P a (x) = -^— f dtt^e- 1 (20) 
1 \ a ) Jo 

is the normalised lower incomplete Gamma function. Note that for a fixed experimental 
input Q the quantity <i 2 (Q) is usually denoted as xLm as ^ i s the absolute minimum of 
the chi-square function (7). 

A formula for p-values in likelihood ratio tests of nested models can be derived in 
a similar way. Keeping a subset a = (oi)»=i r of theory parameters fixed defines a 
(k — r)-dimensional sub-manifold M a of the theory manifold M. Now we shift the best- 
fit point Q a (Q) of the restricted theory to the origin. In the vicinity of the origin, we 
again approximate M as a hyperplane and M a as a linear subspace of M. We decompose 
the vector Q' into three orthogonal components 

Q' = Q; + Q' 2 + Q' 3 (21) 

with G M a and + Q' 2 G M. The function A^(Q') in (11) is then 

Aa(Q') = dl(Q') - rf 2 (Q') = |Q 2 + Q' 3 | 2 - |Q' 3 | 2 = |Q' 2 | 2 . (22) 

For a three-dimensional sample space, a two-dimensional theory manifold M and a 
one-dimensional theory manifold M a this is depicted in Fig. 2. The 9 function in (11) 
vanishes in the region between the planes indicated by the thin dashed lines. Thus, 
Eq. 11 reads 

Pa(Q) = j^j- 2 J d^Q[e-^ 2 J d- k %e^ 2 j tTQae'sWedQ^I 2 - A a (Q)) 
= l- J R r/2 (iA 2 (Q)) . (23) 

For a fixed experimental input Q, the quantity A 2 (Q) is usually denoted as A% 2 (a), 
since it is the difference of minimal chi-square values in the restricted and the full theory. 

We see that the derivations of the analytical formulae (19) and (23) rely on the 
following assumptions: 



8 



Figure 2: Orthogonal decomposition of the sample space for a likelihood ratio test of nested 
models. The blue colour indicates the probability density for the toy observable vector Q'. 
The 6 function in (11) vanishes in the region between the planes indicated by the thin dashed 
lines. 

1. The experimental errors on all observables are Gaussian (with or without correla- 
tions) . 

2. The theory manifold (in the space of normalised observables) can be approximated 

hyperplane. This approximation is valid if 

(a) the curvature of the theory manifold (in the space of normalised observables) 
can be neglected and 

(b) the best fit point (s) is (are) far away from any borders the theory manifold(s) 
might have. 

Assumption 2(a) is valid if the errors of the observables are small enough. Assumption 
2(b) fails if the mapping between the parameter space and the theory manifold is singular 
in the vicinity of the best-fit point or if any of the best-fit parameters are close to their 
upper or lower limit. In particular, assumption 2(b) is invalid in the case of (small) 
systematic errors, if these errors are treated within the i?Fit scheme. 

In the case of non-nested models, as discussed at the end of the last section, the the- 
ory manifold can at best be approximated as a union of two hyperplanes with (possibly) 
different dimensionality. In this case Eq. 23 is clearly not applicable. 

4 Numerical Calculation of p- values 

In the last section we have seen that there are many realistic scenarios where the simple 
analytic formulae (19) and (23) are not valid. In this case on has to resort to numerical 
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integration methods to calculate p-values. Based on the geometric view established in 
the last two sections I will now discuss some methods for improving the efficiency of 
these integrations. I will focus on the computation of p- values in likelihood ratio test, 
both for nested and non-nested models. The basic idea behind these methods is to 
optimise the numerical integration for the case where the errors on the observables are 
Gaussian and the manifolds of the two models (nested or non-nested) do indeed form 
hyperplanes in the space of normalised observables. The important difference is that 
these integration methods do not rely on the assumptions listed at the end of the last 
section. They only work best if they are valid. 

Consider a set of observables Oj (i = l,...,n) whose distribution about their true 
values O is described by a log-likelihood function L(0,0). By definition, the log- 
likelihood function, for fixed O and considered as a function of O, has its absolute 
minimum at O = O. We write the second-order Taylor-expansion of L about it's 
minimum as 

l(6,o) = l(6,6) + ^(o-6) t v- 1 (o-6) + o(|o-6| 3 ) . (24) 

In this approximation, the PDF 7Tq(0) is always a Gaussian distribution with variance 
matrix V. Thus, for an arbitrary log-likelihood function, we may define the normalised 
observables Q analogous to Sec. 3, where V is now the Hessian matrix of the likelihood 
function at its minimum. The image of O in the space of normalised observables is 
and the log-likelihood function in terms of the normalised observables is (up to an 
irrelevant constant) 

L(0,Q) = |Q| 2 + 0(|Q| 3 ) . (25) 

Now assume that the observables O are a possible prediction of some theory 0. Let 
M be the manifold of the theory in the space of normalised observables. This means that 
M contains the origin. In the vicinity of the origin we may now approximate the theory 
manifold M by its tangent hyperplane H at the origin. Numerically, the hyperplane 
H can be constructed from the derivatives of the theory function with respect to all 
its parameters. Since H contains the origin, it is a subspace of R™. Similarly, we may 
consider some restriction a of the theory 0, whose image also contains the observable 
vector O. In the space of normalised observables this theory defines another subspace 
H a C H. All this is fairly similar to the case discussed in Sec. 3. There, however, the 
quadratic approximation (24) for the log-likelihood function was assumed to be exact 
and the theory manifolds M and M a were identical to the tangent hyperplanes H and 

Now recall the expression (11) for the p-value in a likelihood ratio test: 

p a (Q) = J eTQ' ttoCQ'MA'CQ') - A 2 (Q)) , (26) 

where 

^(Q) = 7^^xp[-i|Q| 2 + 0(|Q| 3 )] (27 ) 
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is the PDF in the space of normalised observables. To calculate the integral (26) nu- 
merically via Monte-Carlo simulation, one generates N sample points Q^, which are 
randomly distributed according to some PDF p(Q')- The integral is then estimated by 

^)^E^( A a(Q;)-^(Q)) • (28) 

To reduce the statistical error of this estimate on has to choose the function p as similar 
as possible to the integrand, so that the terms in the sum are (ideally) all of the same 
size. 

The first, obvious thing to do is to choose p proportional to e - ^'^ ' 2 , so that this 
factor cancels between p and 7Tq. This means that the sample points Q'j should follow an 
n-dimensional normal distribution. To further improve the estimate, we have to inspect 
the argument of the 9 function. Note that A^(Q) does not depend on the integration 
variable, and A^(Q') = d a (Q') —d 2 (Q'). Up to terms of order |Q'| 3 the squared distance 
functions can be calculated by projecting the vector Q' on the orthogonal complements 
H a and H 1 - of the subspaces H a and H, respectively. Just as in the derivation of (23), 
we write Q' as the sum of three orthogonal vectors Q' 1? Q' 2 and Q' 3 with Q' x G H a , 
Q; + Q' 2 G H and Q' 3 G H ± . Then 

A^(Q / ) = IQ 2 + Q / 3l 2 -IQ / 3l 2 = IQ 2 r • (29) 

Thus, if the theory manifolds were identical to the hyperplanes H a and H, the integrand 
in (26) would be zero in the inner region defined by |Q 2 | 2 < A^(Q) and we should not 
waste time by throwing sample points in that region. Since the theory manifolds are 
not actually hyperplanes we should be a bit more careful and also put a few sample 
points in the inner region. There are many ways to do this. My suggestion is to define 
p as 

0(o >) - e -|IQi+Qil 2 f a W a > IQ' 2 I 2 <A 2 (Q) 

P(Q) , |Q' 2 | 2 >Ai(Q) ' (30) 

The parameters and a,b,a > may be tuned to improve the efficiency of the nu- 
merical integration (subject, of course, to the constraint that the PDF p is properly 
normalised). The problem of actually generating points according to this distribution 
will be addressed in Sec. 5. 

Now consider the case of non-nested models @i and ©2- Let us assume that ©i 
has the smaller \ 2 value and we want to compute the p-value of ©2, assuming that 
one of the two models is correct. To do this, we assume that @ 2 is realised with its 
best-fit parameters. This is our null hypothesis. In the space of normalised observables 
Q the best-fit point of ©2 is again the origin, while the best-fit point of ©i is some 
other vector Q 1 . We may again approximate the two theory manifolds by their tangent 
hyperplanes Hi and H2 at the respective best-fit points. Thus, H2 contains the origin 
and is a subspace of R n . Hi, on the other hand, contains Q x but not necessarily the 
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Figure 3: Orthogonal decomposition of a three-dimensional sample space for non-nested 
models. The tangent hyperplane H\ of model 0i is one- dimensional and the tangent hyper- 
plane H 2 of model ©2 is two-dimensional. The thin dashed line is the projection of H\ onto 
H 2 . The blue colour indicates the probability density for the toy observable vector Q'. 



origin, so it is not a subspace. This makes an estimation of the support of the 9 function 
more difficult. Let H 2 i C H 2 be the subspace obtained by shifting Hi by — Qi (so that it 
contains the origin) and projecting it onto H 2 . Furthermore, let H 22 be the orthogonal 
complement of i^2i in H 2 . The projection of Hi onto H 22 is then a single point C, which 
can be obtained by projecting Q 1 onto H 22 . Any vector Q' can now be written as the 
sum of three orthogonal component vectors Q' 1; Q' 2 and Q3 with Q' x G H 2 i, Q' 2 G H 22 
and Q3 G H^. The distance between Q' and Hi is larger than |C — Q' 2 | since the 
projection of any vector pointing from Q' to Hi onto the subspace if 22 is C — Q' 2 . For 
the special case, where H 2 i, H 22 and if^ are all one-dimensional, these definitions are 
summarised in Fig. 3. 

Now let us assume for a moment that the theory manifolds are identical to the 
hyperplanes Hi and H 2 . For the squared distances c?i(Q') and rf 2 (Q') we then have 

rfl(Q') = IQ3I 2 , ^(QO > |C-Q' 2 | 2 . (31) 

Thus, the 9 function 0(A|(Q') - A|(Q)) vanishes for 

|Q 3 | 2 <A^(Q) + |C-Q 2 | 2 . (32) 



Note, however, that contrary to the case of nested models the 9 function may also vanish 
outside this region, even if the approximation of the theory manifolds by hyperplanes 
is exact. Nonetheless, a good strategy for improving the efficiency of the integration is 
to construct a PDF p(Q') which "avoids" this region: 
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Note that the relation between a, b and a, which assures that p is properly normalised 
now also depends on |Q' 2 |. But there are still two free parameters which can be tuned to 
improve the efficiency of the numerical integration. To generate an ensemble of points 
distributed according to p one should first fix the components Q[ and Q' 2 by generating 
an equivalent number of normally distributed random variables. The component Q' 3 
can then be generated with the steps described in Sec. 5. 

Both, for the case of nested and non-nested models, the efficiency of the integration 
can be improved further by adaptive integration techniques, where the shape of the 
sampling density p is tuned during the actual integration. For the adaptation, the 
implementation in my Fitter uses the OmniComp/Dvegas package [9] by Nikolas Kauer, 
which implements the VEGAS algorithm [10] and was developed in the context of [11; 
12]. Thanks to OmniComp, parallelised integration is fully supported. 

5 Introducing rra/Fitter 

The ideas for the numerical computation of p-values outlined in the last section have 
been implemented in a publicly available code called rm/Fitter. The source code is 
available at Hepforge [13]. Detailed documentation will be included in the source distri- 
bution. Here I just want to provide a brief description of the application programming 
interface (API) and discuss some details of the implementation. 

myFitter is a C++ class library and makes extensive use of inheritance and poly- 
morphism to separate the tasks of fitting a model to experimental data and computing 
p-values from the tasks of implementing the observables (as functions of the model's 
parameters) or the log-likelihood function (as a function of the observables). The main 
classes the user will have to deal with are: 

Model This is the base class for all models implemented by the user. It essentially 
represents the theory function © from earlier sections, i.e. the map from the 
model's parameter space to the space of observables (sample space). The base 
class provides functionality for storing "current" values of parameters, observables 
and derivatives of observables with respect to the parameters, setting ranges in 
which parameters are allowed to float or fixing them (so that they do not float at 
all). It can also randomly sample the parameter space and build up a dictionary 
of parameter values and the corresponding observable values. This dictionary can 
be used to find good starting points for numerical minimisations of x 2 functions. 
To implement their own model, the user has to subclass Model and overload the 
method calc() which computes the observables based on the current values of the 
parameters. They may also overload the method calc_deriv() , which calculates 
the derivatives of all observables with respect to all parameters. The default 
implementation uses simple numerical differentiation. 
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LikelihoodComponent This is the base class for objects that represent terms in the 
log-likelihood function L (see Sec. 2). Each likelihood component represents 
the contribution from one or more observables Oi to the log-likelihood func- 
tion. To calculate the value of the log-likelihood function, the contributions 
of all LikelihoodComponent objects are added up. This is done by another 
class, LikelihoodFunction, which acts as a container for LikelihoodComponent 
objects. Derived classes of LikelihoodComponent must overload the method 
calc(0,0), which takes two vectors as arguments (the first being the "pre- 
dicted" values of the observables and the second being the "measured" ones) 
and returns the contribution of the term to the log-likelihood function. Addition- 
ally, the methods calc_deriv() and get_hessian() must be implemented, which 
calculate the derivatives with respect to the Oj and the Hessian matrix for the min- 
imum at O = O. Implementations for the most common likelihood components 
are also available. These classes are: Gauss ianLC (for single observables with 
a Gaussian and possibly systematic errors), AsymmetricGauss ianLC (for single 
observables with asymmetric Gaussian error bars and possibly systematic errors) 
and CorrelatedGaussianLC (for several observables with Gaussian errors and a 
correlation matrix). 

Fitter Objects of this type are responsible for fitting the parameters of mod- 
els (represented by Model objects) to experimental data (represented by a 
LikelihoodFunction object) and for computing p-values by numerical integra- 
tion. Each Fitter object contains a LikelihoodFunction object which is ac- 
cessible through the likelihood_f unction () method and must be "filled" with 
LikelihoodComponent objects before any fits can be done. Once the likeli- 
hood function is initialised, fits can be performed with the local_fit() and 
global_f it () methods. As arguments, these methods take the Model object to be 
fitted and (optionally) a vector of central values for the observables. If no central 
values are given, the defaults from the likelihood_f unctionO are used. The dif- 
ference between these methods is that local_f it() uses the current values of the 
model parameters as starting point for the \ 2 minimisation, while global_fit() 
uses the dictionary created by a previous call to the model's scan() method. The 
p-values for likelihood ratio tests of nested and non-nested models can be cal- 
culated with the methods calc_nested_lrt_pvalue () and calc_lrt_pvalue(), 
respectively. As arguments, these two methods take the models to be compared. 
Note that, for calc_nested_lrt_pvale() to work, the second model must be a 
restricted version of the first, i.e. a copy of the first object with some additional 
parameters fixed. In addition to these methods, the Fitter class contains nu- 
merous options and flags that control the accuracy and various other aspects of 
the minimisation and integration routines. These options will be described in the 
package documentation. Most notably, the p-value integrations can be parallelised 
without additional programming efforts by the user. 
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To minimise the \ 2 function, rm/Fitter currently uses the multidimensional minimi- 
sation algorithms from the GNU scientific library (GSL). The GSL provides several algo- 
rithms and the user can choose their favourite one with a call to Fitter : :minimizer () . 
While these minimisation algorithms are sophisticated enough for most purposes, the 
possibilities for determining if a minimisation has converged are not. Currently, 
rm/Fitter just aborts a minimisation when the length of the gradient is smaller than 
a number provided by the user. This is acceptable if the user also provides a good 
estimate of the width of the x 2 minimum in the direction of each parameter. (These 
can be set with the Model: :scale() method.) Other codes such as Minuit [14] use 
more sophisticated convergence tests and should be interfaced with m^Fitter in future 
releases. 1 

The problem of minimising a function of bounded parameters (i.e. of parameters that 
have an upper or lower limit) is solved in the usual way by smoothly and invertably 
mapping the real axis R to the allowed range of the parameter. Internally, myFitter 
does this with the function 



However, since this way the region close to a parameter's upper or lower bound is 
stretched out to infinity, minimisation algorithms will often struggle to converge on a 
minimum that is close to or even on a parameter's upper or lower bound. Normally, some 
intervention by the user is required in this case. But when this happens during a p- value 
integration rm/Fitter can not stop each time and ask the user what to do. It therefore 
uses a simple "snapping" rule: Whenever, during a \ 2 minimisation, a parameter gets 
too close to its upper or lower limit (the exact distance can be configured by the user), 
m^Fitter fixes that parameter at its limit and restarts the minimisation with one free 
parameter less. 

Finally, let me explain the implementation of the integrand mappings from Sec. 4 in 
a bit more detail. To this end, note that the PDFs from Eq. 30 and 33 both decompose 
into three factors pi(Q'i)p2(Q2)P3(Q3) where two factors are always proportional to an 
nj- dimensional normal distribution (rij being the dimension of the subspace that lives 
in). Generating vectors distributed according to an n^-dimensional normal distribution 
is simple, because each component of the vector is then distributed according to a 
one-dimensional normal distribution. 

The remaining problem is to generate an n-dimensional random vector Q distributed 
according to the PDF 



^n fact, I initially used Minuit2 [15] (the C++ version of Minuit) for minimisation, but the imple- 
mentation proved unstable and would, in some cases, inexplicably start calling the % 2 function with 
NANs as parameter values. 




1 



(34) 




Q| 2 < A 2 
Q| 2 > A 2 



(35) 
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for some A > 0. We first note that the PDF p is rotationally invariant. The length R 
of the vector Q is then distributed according to a PDF 



2W 2 I aR a R 2 < A 2 

HK ! r(n/2) [fe-l^ 2 , R 2 > A 2 K J 

We may write this as 

p(R) = fP<(R) + a-f)~P>(R) (37) 
where / G [0, 1] is a free parameter and 

MK)=^w(A-*) , feW = 2 ,..- 2 ^;:;^ (fl f r / g A2)) m 

are PDFs normalised to 1. Here, P n /2 is the normalised lower incomplete Gamma 
function (20). Since p < and p> are normalised, the parameter / is just the fraction 
of sample points that will be put in the inner region with R < A. For a given /, the 
corresponding values of a and b are 

= fT{n/2){n + a) = 1-f 

27T n/2 A n+a ' (2tt)"/ 2 (1 - P n/2 (§A 2 )) ' 1 ' 

By integrating p p> and p from to R we obtain the cumulative distribution functions 
(CDFs) 

CDF^)^ , CDF, >( ^) = ^^-^^ 



l-P n/2 {\A 2 ) 

f CDF^ (R) , R<A 

f + (l-f)CDF^(R) , P>A 

To generate random variables R distributed according to p we need the inverse of CDF,: 




CDF- < 1 (,) = Ax 1 ^) , Cmzl( x ) = ^2P-; 2 (x + {l-x)P n/2 {\A 2 )) 

- CDF, (*) - l^-x^ , x >, ■ W 

where P"^ i s the inverse of the normalised lower incomplete Gamma function. This 
function is implemented in special functions libraries such as C++ Boost: :math. Ran- 
dom vectors Q distributed according to p may now be generated in the following way: 
First generate a vector Q' according to an n-dimensional normal distribution. Then 
pick a uniformly distributed random variable x E [0,1] and set R = CDF j5 " 1 (x). The 
variable R is then distributed according to p. The vector Q with the correct random 
distribution is Q = (P/|Q'|)Q'. 



16 



In this form, the integral is well-suited for adaptive integration by standard im- 
plementations of the VEGAS algorithm. Starting with a uniform distribution, we let 
VEGAS adapt the distribution of the variable x to improve the efficiency of the integra- 
tion. Currently, rm/Fitter does not use adaptation for any other integration variables, 
but this may change in the future. As mentioned earlier, the VEGAS implementa- 
tion used by rm/Fitter is the the OmniComp/Dvegas package [ ] by Nikolas Kauer and 
supports parallelised integration. 

6 Conclusions 

To fully exploit the the constraints that the LHC (and other) experiments impose on 
models of new physics, reliable methods for the computation of p- values in (frequentist) 
statistical analyses are needed. There are many realistic situations in which the well- 
known analytical expressions relating \ 2 or A% 2 values and p- values are not (necessarily) 
valid. These include the comparison of non-nested models and fits where systematic 
errors are treated within the RFit scheme. In these situations one has to resort to 
numerical methods. 

Numerical computations of p- values are essentially a Monte-Carlo integration with an 
integrand function that is (usually) similar to a multidimensional Gaussian distribution 
times a 9 function which is zero in some inner region around the maximum of the 
Gaussian distribution. For small p-values, the inner region is very large and the Monte- 
Carlo integration becomes inefficient unless the sampling density p is chosen in a way 
that avoids the inner region. In this paper I have shown how to construct suitable 
sampling densities p in a very generic setting. These concepts were implemented in 
a C++ framework for maximum likelihood fits called rm/Fitter, which is available for 
download at Hepforge [13]. 

The methods presented here and their implementation in myFitter have recently 
been applied in a in a global analysis [ ] of the Standard Model with a fourth generation 
of fermions, where the analytical computation of p- values is impossible due to the non- 
decoupling nature of the fourth generation fermions. 
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